Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
debakarr
GitHub Repository: debakarr/machinelearning
Path: blob/master/Part 2 - Regression/Decision Tree Regression/[R] Decision Tree Regression.ipynb
1009 views
Kernel: R
library("IRdisplay")

Classification And Regression Trees(CART) is a term introduced by Leo Breiman to refer Decision Tree algorithm which is used to predict using classification or regression model.


display_png(file="img/01.png")
Image in a Jupyter notebook
display_png(file="img/02.png")
Image in a Jupyter notebook
display_png(file="img/03.png")
Image in a Jupyter notebook

The algorithm split the data into several terminal leaves which denotes the average. Above we have two independent variables and one dependent variable. Depending on the value of two new independent variable we can predict the value of dependent variable with a more precise manner rather then the naive approach(where no matter what are the two new dependent variables are we will assign the value of average of all the points to the dependent variable corresponds to the two independent variable).

For exampe let's say we want to predict the dependent variable for two independent variable, X1 = 30, X2 = 100 (say).

The from the decision tree we can say that Y = -64.1 (as X1 < 20 => No, X2 < 170 => Yes and X1 < 40 = Yes)


Data Preprocessing

# Importing the dataset dataset = read.csv('Position_Salaries.csv') dataset = dataset[2:3]
dataset

Fitting the Decision Tree Regression Model to the dataset

# install.packages('rpart') library(rpart) regressor = rpart(formula = Salary ~ ., data = dataset, control = rpart.control(minsplit = 1))
summary(regressor)
Call: rpart(formula = Salary ~ ., data = dataset, control = rpart.control(minsplit = 1)) n= 10 CP nsplit rel error xerror xstd 1 0.77638626 0 1.00000000 1.234568 0.7835133 2 0.15496716 1 0.22361374 1.481711 0.7808785 3 0.05217357 2 0.06864658 1.481711 0.7808785 4 0.01000000 3 0.01647301 1.481711 0.7808785 Variable importance Level 100 Node number 1: 10 observations, complexity param=0.7763863 mean=249500, MSE=8.066225e+10 left son=2 (8 obs) right son=3 (2 obs) Primary splits: Level < 8.5 to the left, improve=0.7763863, (0 missing) Node number 2: 8 observations, complexity param=0.05217357 mean=124375, MSE=6.921484e+09 left son=4 (6 obs) right son=5 (2 obs) Primary splits: Level < 6.5 to the left, improve=0.7600316, (0 missing) Node number 3: 2 observations, complexity param=0.1549672 mean=750000, MSE=6.25e+10 left son=6 (1 obs) right son=7 (1 obs) Primary splits: Level < 9.5 to the left, improve=1, (0 missing) Node number 4: 6 observations mean=82500, MSE=1.38125e+09 Node number 5: 2 observations mean=250000, MSE=2.5e+09 Node number 6: 1 observations mean=500000, MSE=0 Node number 7: 1 observations mean=1000000, MSE=0

Predicting a new result

y_pred = predict(regressor, data.frame(Level = 6.5)) y_pred

Visualising the Regression Model results

# install.packages('ggplot2') library(ggplot2) ggplot() + geom_point(aes(x = dataset$Level, y = dataset$Salary), colour = 'red') + geom_line(aes(x = dataset$Level, y = predict(regressor, newdata = dataset)), colour = 'blue') + ggtitle('Truth or Bluff (Regression Model)') + xlab('Level') + ylab('Salary')
MIME type unknown not supported
Image in a Jupyter notebook
# Visualising the Regression Model results (for higher resolution and smoother curve) # install.packages('ggplot2') library(ggplot2) x_grid = seq(min(dataset$Level), max(dataset$Level), 0.01) ggplot() + geom_point(aes(x = dataset$Level, y = dataset$Salary), colour = 'red') + geom_line(aes(x = x_grid, y = predict(regressor, newdata = data.frame(Level = x_grid))), colour = 'blue') + ggtitle('Truth or Bluff (Regression Model)') + xlab('Level') + ylab('Salary')
MIME type unknown not supported
Image in a Jupyter notebook

From the above graph it is obvious that we are getting an average value for each interval. Also the interval are:

1 to 6.5, 6.5 to 8.5, 8.5 to 9.5, 9.5 to 10.

Value of Salary for level from 6.5 and 8.5 is 250000.